Association Rules Mining

Team composition:

  • Amihaesei Sergiu
  • Stoica George
data <- read.csv(file = 'heart.csv')
# Pre processing
data[["age"]] = data[["ï..age"]]
data = data[,!(names(data) %in% c("ï..age"))]
names(data)
##  [1] "sex"      "cp"       "trestbps" "chol"     "fbs"      "restecg" 
##  [7] "thalach"  "exang"    "oldpeak"  "slope"    "ca"       "thal"    
## [13] "target"   "age"
copy = data.frame(data)

Numerical attributes

numerical_attributes = c("age", "trestbps", "chol", "oldpeak", "thalach")
numerical_attributes
## [1] "age"      "trestbps" "chol"     "oldpeak"  "thalach"

Age

  • We observed in the previous task that there age variables seems to follow two normal distributions, so we split the age into two categorical variables
  • Age < 50 -> young
  • Age >= 50 -> old
data$age[data$age < 50] = "young"
data$age[!data$age %in% c("young")] = "old"
as.data.frame(table(data$age))

Trestbps

  • Trestbps is the resting blood pressure
  • From a medical point of view, values bellow 120 mm are normal, while those that are higher can indicate problems.
data$trestbps[data$trestbps < 120] = "normal blood pressure"
data$trestbps[!data$trestbps %in% c("normal blood pressure")] = "high blood pressure"
as.data.frame(table(data$trestbps))

Chol

  • Cholesterol values are normal if between 125 mg/dl and 200 mg/dl, otherwise they are suspicious.
  • In our case we only have halues higher than the lower required value, so we only check for values bigger than 200 mg/dl.
data$chol[data$chol < 200] = "normal chol"
data$chol[!data$chol %in% c("normal chol")] = "high chol"
as.data.frame(table(data$chol))

Thalac

  • The maximum heart rate achieved
  • According to medical expertise, the normal value should be near 220 - age. However, the method is inaccurate already at an age of 30-40 years, which is the majority of our population. So we decided to follow this table:
    • 20 years: 100 to 170 beats per minute
    • 30 years: 95 to 162 beats per minute
    • 35 years: 93 to 157 beats per minute
    • 40 years: 90 to 153 beats per minute
    • 45 years: 88 to 149 beats per minute
    • 50 years: 85 to 145 beats per minute
    • 55 years: 83 to 140 beats per minute
    • 60 years: 80 to 136 beats per minute
    • 65 years: 78 to 132 beats per minute
    • 70 years: 75 to 128 beats per minute
  • Values that are within the bounds are normal values, while higher or lower represent anomalies.
assign_value = function(index, low, high) {
  if (copy$thalach[index] < low) {
    data$thalach[index] <<- "low HR"
  } else if (copy$thalach[index] > high) {
    data$thalach[index] <<- "high HR"
  } else {
    data$thalach[index] <<- "normal HR"
  }
}

for (i in 1:length(copy$age)) {
  if (copy$age[i] < 20) {
    assign_value(i, 100, 170)
  } else if (copy$age[i] < 30) {
    assign_value(i, 95, 162)
  } else if (copy$age[i] < 35) {
    assign_value(i, 93, 157)
  } else if (copy$age[i] < 40) {
    assign_value(i, 90, 153)
  } else if (copy$age[i] < 45) {
    assign_value(i, 88, 149)
  } else if (copy$age[i] < 50) {
    assign_value(i, 85, 145)
  } else if (copy$age[i] < 55) {
    assign_value(i, 83, 140)
  } else if (copy$age[i] < 60) {
    assign_value(i, 80, 136)
  } else if (copy$age[i] < 65) {
    assign_value(i, 78, 132)
  } else {
    assign_value(i, 75, 128)
  }
}
as.data.frame(table(data$thalach))

Oldpeak

  • ST depression induced by exercise relative to rest.
  • We don’t know what that means, so we decided to split the values into 3 equal sized groups.
sorted_values = sort(data$oldpeak)
v1 = sorted_values[length(sorted_values) / 3]
v2 = sorted_values[2 * length(sorted_values) / 3]
for (i in 1:length(sorted_values)) {
  if (data$oldpeak[i] < v1) {
    data$oldpeak[i] = paste0("oldpeak [", (min(sorted_values)), ",", (v1), ")")
  } else if (data$oldpeak[i] < v2) {
    data$oldpeak[i] = paste0("oldpeak [",(v1), ",",  (v2), ")")
  } else {
    data$oldpeak[i] = paste0("oldpeak [", (v2), ",", (max(sorted_values)), "]")
  }
  
}
data$fbs = paste0("fbs", data$fbs)
data$restecg = paste0("restecg", data$restecg)
data$slope = paste0("slope", data$slope)
data$thal = paste0("thal", data$thal)
data$ca = paste0("ca", data$ca)
data$target = paste0("target", data$target)
data$exang = paste0("exang", data$exang)
data$sex = paste0("sex", data$sex)
data$cp = paste0("cp", data$cp)

We visualize the new data

data

Convert data to transactions

library(arules)
library(arulesViz)

trans <- arules::transactions(data)
trans
## transactions in sparse format with
##  303 transactions (rows) and
##  39 items (columns)

We visualize the top 20 most frequent items (that have the highest support).

We can see that the majority of the individuals (>80%) don’t have diabetes (fbs=0), but have a high cholestelor level, and high blood pressure. Also most of them are old (>50) with 70% probability.

itemFrequencyPlot(trans, topN=20)

rules_descr = c()
for (i in c(0.1, 0.2, 0.3, 0.4, 0.5)) {
  for (j in c(0.7, 0.8, 0.9, 1.0)) {
    rules <- apriori(trans, supp = i, conf = j, target = "rules", parameter = list(minlen = 2))
    rules_descr <- rbind(rules_descr, c(paste0("Supp = ", i, " Conf = ", j), capture.output(print(rules))))
  }
}

We can see that we have rules with 100% confidence only for support = 0.1.

  • We also have 1 rule with confidence of 0.9 for support = 0.4.
  • And 28 rules with confidence 0.8 for support = 0.5
rules_descr
##       [,1]                    [,2]                 
##  [1,] "Supp = 0.1 Conf = 0.7" "set of 22990 rules "
##  [2,] "Supp = 0.1 Conf = 0.8" "set of 16761 rules "
##  [3,] "Supp = 0.1 Conf = 0.9" "set of 4473 rules " 
##  [4,] "Supp = 0.1 Conf = 1"   "set of 136 rules "  
##  [5,] "Supp = 0.2 Conf = 0.7" "set of 2769 rules " 
##  [6,] "Supp = 0.2 Conf = 0.8" "set of 1984 rules " 
##  [7,] "Supp = 0.2 Conf = 0.9" "set of 244 rules "  
##  [8,] "Supp = 0.2 Conf = 1"   "set of 0 rules "    
##  [9,] "Supp = 0.3 Conf = 0.7" "set of 581 rules "  
## [10,] "Supp = 0.3 Conf = 0.8" "set of 412 rules "  
## [11,] "Supp = 0.3 Conf = 0.9" "set of 18 rules "   
## [12,] "Supp = 0.3 Conf = 1"   "set of 0 rules "    
## [13,] "Supp = 0.4 Conf = 0.7" "set of 117 rules "  
## [14,] "Supp = 0.4 Conf = 0.8" "set of 76 rules "   
## [15,] "Supp = 0.4 Conf = 0.9" "set of 1 rules "    
## [16,] "Supp = 0.4 Conf = 1"   "set of 0 rules "    
## [17,] "Supp = 0.5 Conf = 0.7" "set of 42 rules "   
## [18,] "Supp = 0.5 Conf = 0.8" "set of 28 rules "   
## [19,] "Supp = 0.5 Conf = 0.9" "set of 0 rules "    
## [20,] "Supp = 0.5 Conf = 1"   "set of 0 rules "

Important rules we find interesting:

  • If a person is male, he might also have high blood pressure
  • If a person is male, he might also have fbs = 0 (not enough blood sugar)
  • If a person is male, he might also have high chol
  • Old people might have high blood pressure
  • Old people might have high chol
  • High chol might come with high blood pressure
rules <- apriori(data, supp = 0.5, conf = 0.8, target = "rules", parameter = list(minlen = 2))
result = DATAFRAME(rules, separate=FALSE)
result[order(-result$confidence),]
plot(rules, engine = "html")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(rules, method = "graph", engine = "html")
  • We can observe that there is one rule with support = 0.4 and conf = 0.9.
  • Slope is a variable which measures the ST depression.
  • So if a person has slope = 2, then he might also have high heart rate with a confidence of 0.9.
rules <- apriori(data, supp = 0.4, conf = 0.9, target = "rules", parameter = list(minlen = 2))
result = DATAFRAME(rules, separate=FALSE)
result[order(-result$confidence),]

For support=0.3 and confidence=0.9 we don’t have any interesting rules

rules <- apriori(data, supp = 0.3, conf = 0.9, target = "rules", parameter = list(minlen = 2))
plot(rules, engine = "html")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Interpretation

We can see that 10% of the people, without chest pain (cp=0), high cholesterol, high blood pressure, and that are old, don’t have any heart problems with 100% confidence.

rules <- apriori(data, supp = 0.1, conf = 1, target = "rules", parameter = list(minlen = 2))
plot(rules, engine = "html")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Interpretation:

  • When using support = 0.2 we had too many rules, so we limited ourselves to support = 0.25.
  • We can see that we have 4 rules with a high confidence.
  • All of these rules have target = 1, which is normal, because on our dataset it is easier for a set of variables to imply that a patient suffers from heart disease than for to clearly say that a patient is healthy.
targetItems = grep("^target=", itemLabels(trans), value=TRUE)
rules = apriori(trans, supp = 0.25, conf = 0.9, target = "rules", parameter = list(minlen = 2), appearance = list(rhs = targetItems))
result = DATAFRAME(rules, separate=FALSE)
result[order(-result$confidence),]
plot(rules, method = "graph", engine = "html")
  • These are rules which have a higher support but a lower confidence
rules = apriori(trans, supp = 0.4, conf = 0.7, target = "rules", parameter = list(minlen = 2), appearance = list(rhs = targetItems))
result = DATAFRAME(rules, separate=FALSE)
result[order(-result$confidence),]
plot(rules, method = "graph", engine = "html")
  • Here we have one of the few cases in which we have an association rule with target = 0.
  • We have a confidence of 0.7 to be healthy if we do not have chest pains.
targetItems = grep("^target=target0", itemLabels(trans), value=TRUE)
rules = apriori(trans, supp = 0.3, conf = 0.7, target = "rules", parameter = list(minlen = 2), appearance = list(rhs = targetItems))
result = DATAFRAME(rules, separate=FALSE)
result[order(-result$confidence),]

Conclusion:

  • From the above analysis we see that it is hard to have a set of variables in which on the right side target = 0.
  • That means that we can’t tell for sure if a person is healthy according to the clinical indices, because he might suffer from this disease due to other factors.
  • Moreover, we do not have very clear indicators with high support and high confidence. The association rules with confidence > 0.9 have a support close to 0.25, while the rules with support > 0.4 have confidence close to 0.7.
  • However, we can still create weak classifiers which use the above shown groups and we extracted some good associations
    • No chest pain means one is healthy with a confidence of 0.7 and support 0.3
    • No exercise induced angina and high HR means hearth disease with a confidence of 0.7 and support 0.4
    • exang=exang0, ca=ca0, thal=thal2 means hearth disease with a confidence of 0.9 and a support close to 0.3